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ABSTRACT 

We present a new N-body and gas dynamics code, called Nyx, for large-scale 
cosmological simulations. Nyx follows the temporal evolution of a system of dis- 
crete dark matter particles gravitationally coupled to an inviscid ideal fluid in an 
expanding universe. The gas is advanced in an Eulerian framework with block- 
structured adaptive mesh refinement (AMR); a particle-mesh (PM) scheme using 
the same grid hierarchy is used to solve for self-gravity and advance the parti- 
cles. Computational results demonstrating the validation of Nyx on standard 
cosmological test problems, and the scaling behavior of Nyx to 50,000 cores, are 
presented. 

1. Introduction 

Understanding how the distribution of matter in the Universe evolves throughout cosmic 
history is one of the major goals of cosmology. Properties of the large-scale structure depend 
strongly on a cosmological model, as well as the initial conditions in that model. Thus, in 
principle, we can determine the cosmological parameters of our Universe by matching the 
observed distribution of matter to the best-fit calculated model universe. This is much easier 
said then done. Observations, with the exception of gravitational lensing, cannot directly 
access the dominant form of matter, the dark matter. Instead they reveal information about 
baryons, which in some biased way trace the dark m atter. On the theory side, equations 



describing the evolution of density perturbations (e.g. iPeeblesI (Il980[ )) can be easily solved 
only when the amplitude of perturbations is small. However, in most systems of interest, the 
local matter density, p, can be several orders of magnitude greater than the average density, 
Po, making the density contrast, 5 = (p — po)/po, much greater then unity. 
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Due to this non-linearity, and the uncertainty in the initial conditions of the Uni- 
verse, a single "heroic" simulation is not sufficient to resolve most of the cosmological 
questions of interest; the ability to run many simulations of different cosmologies, with 
different approximations t o the governing physics, is imperative. Semi-analytic models (for 



exa mple IWhite fc Frenkl (119911 ). in the context of galaxy formation) or scaling relations 
(e.g. ISmith et all d2003h ) present an interesting alternative to expensive simulations, but 
they must also be validated or calibrated by simulations in order to be trustworthy. Finally, 
besides being an invaluable theorist's tool, simulations play a crucial role in bridging the gap 
between theory and observation, something that is becoming more and more important as 
large sky surveys are starting to collect data. 

To match the capabilities of future galaxy/cluster surveys, cosmological simulations 
must be able to simulate volumes ~1 Gpc on a side while resolving scales on the order of 
~10 kpc or even smaller. Clearly, in the context of Eulerian gas dynamics, this requires 
the ability to run simulations with multiple levels of refinement that follow the formation of 
relevant objects. To resolve L* galaxies with ~100 mass elements (particles) per galaxy in 
simulations of this scale, one needs tens of billions of particles. Such systems, with crossing 
times of a million years or less, must be evolved for ~10 billion years. On the other side 
of the spectrum are simulations aimed at predicting Ly-a forest observations, where the 
dynamical range is more like ~10 4 , but as most of the contribution from the forest comes 
from gas at close to mean density, most of the computational volume must be resolved. These 
simulations require much "shallower" mesh refinement strategies covering a larger fraction 
of the domain. 

In addition to the challenge of achieving sufficient refinement, simulations need to incor- 
porate a range of different physics beyond the obvious self-gravity. The presence of baryons 
introduces corrections to gravity-only predictions; these corrections are negligible at very 
large scales, but become increasingly significant at smaller scales, requiring an accurate 
treatment of the gas in order to provide realistic results that can be directly compared to ob- 
servations. Radiative cooling and heating mechanisms are as influential on many observables 
as the addition of the gas itself, but large uncertainties remain in these terms, and our lack 
of knowledge increases as we move into highly non -linear systems. Thus, we can model low 



density Lyman-a absorbers at higher redshift (e.g. IViel et al.l (120051 )). we can r eproduce the 
bulk p roperties of the biggest objects in the Universe - galaxy clusters (see, e.g. IStanek et al. 



(120 10l )). but on smaller and muc h denser scale s, galaxy formation still remains elusive, and 



is a field of active research (e.g., iBensonl (120101 ) ) . 



In this paper, we present a newly developed N-body and gas dynamics code called 
Nyx. The code models dark matter as a system of Lagrangian fluid elements, or "particles," 
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gravitationally coupled to an inviscid ideal fluid representing baryonic matter. The fluid is 
modeled using a finite volume representation in an Eulerian framework with block-structured 
AMR. The mesh structure used to evolve fluid quantities is also used to evolve the particles 
via a particle-mesh method. In order to more accurately treat hypersonic motions, where 
the kinetic energy is many orders of magnitude larger than the internal energy of the gas, 
we employ the dual energy formulation, where both the internal and total energy equations 
are solved on the grid during each time step. 

There are a number of exis ting finite volume, AMR codes; most commonly used in the 

cosm ology community are Enzo (IBrvan et al. Il995h . ART (Kravtsov et al. 1997), FLASH f lFrvxell et al" 
2000), and RAMSES JTevssierl liooi ) . Enzo, FLASH and Nyx are all similar in that they 
use block-structured AMR; RAMSES and ART, by contrast, refine locally on individual 
cells, resulting in very different refinement patterns and data structures. Among the block- 
structured AMR codes, Enzo and FLASH enforce a strict parent-child relationship between 
patches; i.e., each refined patch is fully contained within a single parent patch. FLASH also 
enforces that all grid patches have the same size. Nyx requires only that the union of fine 
patches be contained within the union of coarser patches with suitable proper nesting. Again, 
this results in different refinement patterns, data structures, communication overhead, and 
scaling behavior. 

Time-refinement strategies vary as well. FLASH advances the solution at every level 
with the same time step, At, while Enzo, ART and RAMSES allow for subcycling in time. 
A standard subcycling strategy enforces that At/ 'Ax is constant across levels; ART and 
Enzo allow greater refinement in time than space when appropriate. Nyx maintains several 
different subcycling strategies. The user can specify at run-time that Nyx use no subcycling, 
standard subcycling, a specified subcycling pattern or "optimal subcycling." In the case 
of optimal subcycling, Nyx chooses, at specified coarse time intervals during a run, the 
most efficient subcycling pattern across all level s based on an automat ed assessment of the 
computational cost of each option at that time (IVan Andel et al.ll2013l ). 



One of the major goals behind the development of a new simulation code, in an already 
mature field such as computational cosmology, is to ensure effective utilization of existing 
and future hardware. Future large-scale cosmological simulations will require codes capa- 
ble of efficient scaling to hundreds of thousands of cores, and effective utilization of the 
multicore, shared memory, nature o f individual nodes. One such code for gravity-only simu- 
lations is HACC ( lHabib et al.ll2009l ). The frameworks for existing hydro codes such as Enzo 
and FLASH are currently being substantially re- written fo r current and future architectures. 



Nyx, like the CASTRO code for radiation-hydrodynamics (lAlmgren et al 



201ll ). and the MAESTRO code for low Mach number astrophysics (lAlmgren et al.ll2006al Jb; 



2010 



Zhang et al. 
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Almgren et al.ll2008l : IZingale et allbooa iNonaka et al.ll201Q[ ). is built on the BoxLib (IBoxLib 



20 111 ) software framework for block-structured adaptive mesh methods, and as such it lever- 
ages extensive efforts to achieve the massively parallel performance that future simulations 
will demand. BoxLib currently supports a hierarchical programming approach for multicore 
architectures based on both MPI and OpenMP. Individual routines, such as those to evaluate 
heating and cooling source terms, are written in a modular fashion that makes them easily 
portable to GPUs. 

In the next section we present the equations that are solved in Nyx. In section 3, 
we present an overview of the code, and in section 4 we discuss structured grid AMR and 
the multilevel algorithm. Section 5 describes in more detail the parallel implementation, 
efficiency and scaling results. Section 6 contains two different cosmological tests; the first is 
a pure A^-body simulation, and the second is the Santa Barbara cluster simulation, which 
incorporates both gas dynamics and dark matter. In the final section we discuss future 
algorithmic developments and upcoming simulations using Nyx. 



2. Basic Equations 

2.1. Expanding Universe 

In cosmological simulations, baryonic matter is evolved by solving the equations of 
self-gravitating gas dynamics in a coordinate system that is comoving with the expanding 
universe. This introduces the scale factor, a, into the standard hyperbolic conservation laws 
for gas dynamics, where a is related to the redshift, z, by a — 1/(1 + z), and can also 
serve as a time variable. The evolution of a in Nyx is described by the Friedmann equation 
that, for the two-component model we consider, with Hubble constant, H , and cosmological 
constant, Q\, has the form 

-lna = - = iW-f + , 1 

at a V a 6 

where f2 is the total matter content of the Universe today. 



2.2. Gas Dynamics 



In the comoving frame, we relate the comoving baryonic density, p^, to the proper 
density, p pr0 per, by pb = a 3 p proper , and define U to be the peculiar proper baryonic velocity. 
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The continuity equation is then written as 

| = -V-W). (2) 
at a 

Keeping the equations in conservation form as much as possible, we can write the momentum 
equation as 

d(ap b U) 



dt 



-V-( Pb UU)-Vp + Pb g, (3) 



where the pressure, p, is related to the proper pressure, p prope r, by p = a 3 p proper . Here 
g = — V0 is the gravitational acceleration vector. 



We use a dual energy formulation similar to that used in Enzo ( IBryan et al.lll995l ). in 
which we evolve the internal energy as well as the total energy during each time step. The 
evolution equations for internal energy, e, and total energy, E = e + \U 2 , can be written, 

d(a 2 p b e) 



dt 

d(a 2 p b E) 



-aV ■ (p b Ue) - apV -U + aa((2- 3(7 - l))p b e) + aA HC , (4) 
-aV • (p b UE + pU) + ap b U ■ g + ad ((2 - 3(7 - l))p b e) + ak HC (5) 



dt 

where Ahc represents the combined heating and cooling terms. We synchronize E and e at 
the end of each time step; the procedure depends on the magnitude of e relative to E, and 
is described in the next section. 

We consider the baryonic matter to satisfy a gamma-law equation of state (EOS), where 
we compute the pressure, p = (7 — l)pe, with 7 = 5/3. The composition is assumed to be a 
mixture of hydrogen and helium in their primordial abundances with the different ionization 
states for each element computed by assuming statistical equilibrium. The details of the 
EOS and how it is related to the heating and cooling terms will be discussed in future work. 
Given 7 = 5/3, we can simplify the energy equations to the form 

d(a 2 p b E) 



dt 
d{a 2 p b e) 
dt 



-oV ■ (p b UE + pU) + a(p b U ■ g + A^c) (6) 
-aV- (p b Ue) -apV-U + aA HC . (7) 



2.3. Dark Matter 

Matter content in the Universe is strongly (80-90%) dominated by cold dark matter, 
which can be modeled as a non-relativistic, pressureless fluid. The evolution of the phase- 
space function, /, of the dark matter is thus given by the collisionless Boltzmann (aka Vlasov) 
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equation, which in expanding space is: 

% + • W - mV <^ • 7T- = > ( 8 ) 

at ma 2 op 

where m and p are mass and momentum, respectively, and <fi is the gravitational potential. 

Rather than trying to solve Vlasov equation directly, we use Monte-Carlo sampling of 
the phase-space distribution at some initial time to define a particle representation, and then 
evolve the particles as an N-body system. Since particle orbits are integrals of a Hamiltonian, 
Liouville's theorem ensures that at some later time particles are still sampling the phase space 
distribution as given in equation (jSJ). Thus, in our simulations we represent dark matter as 
discrete particles with particle i having co moving location, x i; and peculiar proper velocity, 
Uj, and solve the system 

rfx, 1 

^ = g, (10) 
where gj is gravitational acceleration evaluated at the location of particle i, i.e., gj = g(xj, t). 



2.4. Self-gravity 

The dark matter and baryons both contribute to the gravitational field and are accel- 
erated by it. Given the comoving dark matter density, p dm , we define by solving 

V 2 0(x,t) = ^^-(pb + Pdm ~ Po) , (11) 

a 

where G is the gravitational constant and po is the average value of {pdm + Pb) over the 
domain. 



3. Code overview 

3.1. Expanding Universe 

Before evolving the gas and dark matter forward in time, we compute the evolution of 
a over the next full coarse level time step. Given a n defined as a at time t n , we calculate 
a n+1 via m second-order Runge-Kutta steps. The value of m is chosen such that the relative 
difference between a n+1 computed with m and 2m steps is less than 10 ~ 8 . This accuracy 
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is typically achieved with m = 8 — 32, and the computational cost of advancing a is trivial 
relative to other parts of the simulation. We note that At is constrained by the growth 
rate of a such that a changes by no more than 1% in each time step at the coarsest level. 
Once a n+1 is computed, values of a needed at intermediate times are computed by linear 
interpolation between a n and a n+1 . 



3.2. Gas Dynamics 



We describe the state of the gas as U = (p&, apbU, a 2 pbE, a 2 p b e), then write the evolution 
of the gas as 

mi , 

12 



_ = -V • F + S e + S g + S HC , 

where F = (1/a pbU, pbUU,a(pbUE + pU),apbUe) is the flux vector, S e = (0,0,0,— apV ■ 
U) represents the additional term in the evolution equation for internal energy, S g = 
(0, pbg, apbU ■ g, 0) represents the gravitational source terms, and Shc — (0, 0, aA HC , aA HC ) 
represents the combined heating and cooling source terms. The state, U, and all source 
terms are defined at cell centers; the fluxes are defined on cell faces. 

We compute F using an unsplit Godunoy meth od with charact e ristic tracing and full 
corner coupling (IColellal Il990t ISaltzmanl Il994l ) . See lAlmgren et al.l (120101 ) for a complete 
description of the algorithm without the scale factor, a; here we include a in all terms as 
appropriate in a manner that preserves second-ord er temporal accuracy. The hy drodynamic 
integration supports both unsplit pie cewise linear (jColellajll99Qt ISaltzmanlll994J ) and unsplit 
piecewise parabolic (PPM) schemes ( Miller fc ColellaT 20021 ) to construct the time-centered 
edge states used to define the fluxes. The piecewise linear v ersion of the implementatio n 
includes the op t ion to include a reference state, as discussed in lColella fc Woodward! ( 119841 ); 
Colella &; Glazl ( 1l985l ) . We choose as a reference state the average of the reconstructed profile 
over the domain of dependence of the fastest characteristic in the cell propagating toward the 
interface. (The value of the reconstructed profile at the edge is used if the fastest character- 
istic propagates away from the edge.) This choice of reference state minimizes the degree to 
which the algorithm relies on the linearized equations. In particular, this choice eliminates 
one component of the characteristic extrapolation to edges used in second-order Godunov 
algorithms. For the hypersonic flows typical of cosmological simulations, we have found the 
use of reference states improves the overall robustness of the algorithm. The Riemann solver 
in Nyx iteratively solves the Riemann problem using a two-shock approximation as described 



in 



Colella fc Glazl (Il985f ); this has been found to be more robust for hyperson ic cosmological 



flows than the simpler linearized approximate Riemann solver described in lAlmgren et al. 



( 120101 ) . In cells where e is less than 0.01% of E, we use e as evolved independently to compute 
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temperature and pressure from the EOS and re-define E as e + \U 2 ; otherwise we re-define 
e as E - \U 2 . 

The gravitational source terms, S g , in the momentum and energy equations are dis- 
cretized in time using a predictor-corrector approach. Two alternatives are available for the 
discretization of pbU ■ g in the total energy equation. The most obvious discretization is to 
compute the product of the cell-centered momentum, (pbU), and cell-centered gravitational 
vector, g, at each time. While this is spatially and temporally second-order accurate, it can 
have the un-physical consequence of changing the internal energy, e = E — \U 2 , since the 
update to E is analytically but not numerically equivalent to the update to the kinetic energy 
calculated using the updates to the momenta. A second alternative defines the update to E 
as the update to \U 2 . A more co mplete rev i ew of the design choices for including self-gravity 



in the Euler equations is given in ISpringell ( 120101 ). 



3.3. Dark Matter 



We evolve the positions and velocities of the dark matter particles using a standard kick- 
drift-kick sequence, identical to that described in, for example, iMiniati fc Colellal (120071 ). We 
compute gj at the i'th particle's location, Xj by linearly interpolating g from cell centers to 
Xj. To move the particles, we first accelerate the particle by At/2, then advance the location 
using this time-centered velocity: 

(13) 



au, 



[aui 



r n+l 



x? + At- 



2 
1 



-u 



< 14 > 

After gravity has been computed at time t n+1 , we complete the update of the particle veloc- 
ities, 

At 

T ! 

We observe th at this scheme is symplectic, thus conserves the integral of motion on average 
(jYoshidalll993l ). Computationally the scheme is also appealing because no additional storage 
is required during the time step to hold intermediate positions or velocities. 



au, 



\n+l 



au, 



r n+l 



(15) 



3.4. Self-gravity 



We define pd m on the mesh using the cloud-in-cell scheme, whereby we assume the 
mass of the i'th particle is uniformly distributed over a cube of side Ax centered at Xj. We 
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assign the mass of each particle to the cells on the grid in proportion to the volume of the 
intersection of each cell with the particle's cube, and divide these cell values by Ax 3 to define 
the comoving density that contributes to the right hand side of Eq. (jlip . 

We solve for on the same cell centers where p& and pd m are defined. The Laplace 
operator is discretized using the standard 7-point finite difference stencil and the resulting 
linear system is solved using geometric multigrid techniques, specifically V-cycles with red- 
black Gauss-Seidel relaxation. The tolerance of the solver is typically set to 10~ 12 , although 
numerical exploration indicates that this tolerance can be relaxed. While the multigrid 
solvers in BoxLib support Dirichlet, Neumann and periodic boundary conditions, for the 
cosmological applications presented here we always assume periodic boundaries. Further 
gains in efficiency result from using the solution from the previous Poisson solve as an initial 
guess for the new solve. 

Given at cell centers, we compute the average value of g over the cell as the cen- 
tered difference of between adjacent cell centers. This discretization, in combination 
with the interpolation of g from cell centers to particle locations described above, ensures 
that on a uniform mesh th e gravitational force of a particle on itself is identically zero 



flHocknev fc Eastwoodlll981l ). 



3.5. Computing the time step 

The time step is computed using the standard CFL condition for explicit methods, with 
additional constraints from the dark matter particles and from the evolution of a. Using the 
user-specified hyperbolic CFL factor, < <7 CFL,hyp < 1 ; and the sound speed, c, computed 
by the equation of state, we define 

At hyp = a CFL > h ^ min ( °f * } , (16) 

yp »=i...3 Lmax x \U ■ ei\ + c J 

where ej is the unit vector in the i direction and max x is the maximum taken over all 
computational grid cells in the domain. 

The time step constraint due to particles requires that the particles not move farther 
than Ax in a single time step. Since u" is not yet known when we compute the time 
step, At at time t n , for the purposes of computing At we estimate u™ + ^ 2 by u™. Using the 
user-specified CFL number for dark matter, < <j CFL > dm < l j that is independent of the 
hyperbolic CFL number, we define 

At dm = a CFL > dm min ( ), (17) 

t=i.. .3 maxj ■ ej| J 
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where max.,- is the maximum taken over all particles j. The default value for <j CFL > dm = 0.5. 
In cases where the particle velocities are initially small but the gravitational acceleration is 
large, we also constrain At^ m by requiring that gjAt^ m < Ax for all particles to prevent 
rapid acceleration resulting in a particle moving too far. 

Finally, we compute At a as the time step over which a would change by 1%. The time 
step taken by the code is set by the smallest of the three: 

At = min (AW, &thyp, At a ) • (18) 

We note that non-zero source terms can also potentially constrain the time step; we defer 
the details of that discussion to a future paper. 



3.6. Single-Level Integration Algorithm 

The algorithm at a single level of refinement begins by computing the time step, At, 
and advancing a from t n to t n+1 = t n + At. The rest of the time step is then composed of 
the following steps: 

Step 1: Compute <p n and g n using p^ and p^ m , where p^ m is computed from the particles at 

■*t • 

We note that in the single-level algorithm we can instead use g as computed at the 
end of the previous step because there have been no changes to Xj or p since then. 

Step 2: Advance U by At. 

If we are using a predictor-corrector approach for the heating and cooling source terms, 
then we include all source terms in the explicit update: 

U n+1 '* = U n - AtV • F n+1/a + AtS e n+1/2 + AtS n g + AtSl c . (19) 

where F n+1 / 2 and ST k are computed by predicting from the U n states. 

If we are instead using Strang splitting for the heating and cooling source terms, we 
first advance (pe) and pE by integrating the source terms in time for |At 

(pep* = (pe) n + J A HC dt' , (20) 
(pE) n '* = ( P E) n + J A HC dt' , (21) 



- 11 - 



then advance the solution using time-centered fluxes and S e and an explicit represen- 
tation of S g at time t n : 

U n+1 >* = U n '* - AtV • F n+1/2 + AtS r e l+1/2 + AtSg 1 (22) 

where F n+1 / 2 and ^ +/2 are computed by predicting from the U n '* states. The details of 
how the heating and cooling source terms are incorporated depend on the specifics of 
the heating and cooling mechanisms represented by Ahc] we defer the details of that 
discussion to a future paper. 

Step 3: Interpolate g n from the grid to the particle locations, then advance the particle 
velocities by At/2 and particle positions by At. 

u," + ' /2 = ^((«X) + y g?) (23) 
x ? +1 - < + ^"T* < 24 > 

Step 4: Compute </> n+1 and g n+1 using p£ + '* and p^ 1 , where p^t 1 is computed from the 
particles at x™ +1 . 

Here we can use <fi n as an initial guess for ip n+1 in order to reduce the time spent in 
multigrid to reach the specified tolerance. 

Step 5: Interpolate g n+1 from the grid to the particle locations, then update the particle 
velocities, u" +1 



< +1 = -ir ( + ^ g? +1 ) (25) 



At 

a n+1 \V ~ l ) ' T 

Step 6: Correct U u>zt/i time-centered source terms, and replace e by E— |[/ 2 as appropriate. 

If we are using a predictor-corrector approach for the heating and cooling source terms, 
we correct the solution by effectively time-centering all the source terms: 



At At 

T 



U n+1 = U" +1 <* + — - fl?) + — (^'* - S n HC ) . (26) 



If we are using Strang splitting for the heating and cooling source terms, we time-center 
the gravitational source terms only, 

-jjn+l,** -jjn+1,* _|_ ^ ( CJ n ~\~^->* 5* n ) (27) 

2 9 9 
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then integrate the source terms for another |At, 



(pe) n+1 = (pe) n+1 >** + J A HC dt' 
(pE) n+1 = ( P E) n+1 '** + J A HC dt' 



(28) 
(29) 



We note here that the time disc retization of the gravitational source terms differs from 
that in Enzo ( Bryan et al. 1995 ). where Sg is computed by extrapolation from values 
at t n and t n ~ l . 

If, at the end of the time step, (pE) n+1 - \p n+1 {U n+l ) 2 > lQ- 4 (pE) n+1 , we re-define 



(Pe) 



n + l 



(pE) n+1 — ^p n+1 (U n+1 ) 2 ; otherwise we use e n+1 as evolved above when needed 



to compute pressure or temperature, and re-define (pE) n+1 = (pe) n+1 + \p n+1 {U n+1 ) 2 . 



This concludes the single-level algorithm description. 



4. AMR 



Block-structured AMR was introduced by iBerger fc Oligerl (119841); i t has been demon- 
strated to be highly succe ssful for gas dynamics by lBerger fc Colellal (|1989l ) in two dimensions 
and by iBell et al.l (119941 ) in three dimensions, and has a long history of successful use in a 
variety of fields. The AMR methodology in Nyx uses a nested hierarchy of rectangular grids 
with refinement of the grids in space by a factor of two between levels, and refinement in 
time between levels as dictated by the specified subcycling algorithm. 



4.1. Mesh Hierarchy 

The grid hierarchy in Nyx is composed of different levels of refinement ranging from 
coarsest [I = 0) to finest (£ = finest)- The maximum number of levels of refinement allowed, 
^max; is specified at the start (or any restart) of a calculation. At any given time in the 
calculation there may be fewer than £ max levels in the hierarchy, i.e. ^finest can change dy- 
namically as the calculation proceeds as long as £fi nes t < -dax- Each level is represented by 
the union of non-overlapping rectangular grids of a given resolution. Each grid is composed 
of an even number of "valid" cells in each coordinate direction; cells are the same size in 
each coordinate direction but grids may have different numbers of cells in each direction. 
Each grid also has "ghost cells" which extend outside the grid by the same number of cells 
in each coordinate direction on both the low and high ends of each grid. These cells are 
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used to temporarily hold data used to update values in the "valid" cells; when subcycling in 
time the ghost cell region must be large enough that a particle at level £ can not leave the 
union of level £ valid and ghost cells over the course of a level £ — 1 timestep. The grids are 
properly nested, in the sense that the union of grids at level £ + 1 is contained in the union of 
grids at level £. Furthermore, the containment is strict in the sense that the level £ grids are 
large enough to guarantee a border at least n pr0 per level £ cells wide surrounding each level 
£ + 1 grid, where n proper is specified by the user. There is no strict parent-child relationship; 
a single level £ + 1 grid can overlay portions of multiple level £ grids. 



We initialize the grid hierarchy and regrid following the procedure outlined in lBell et al. 



( 119941 ) . To define grids at level £ + 1 we first tag cells at level £ where user-specified criteria 
are met. Typical tagging criteria, such as local overdensity, can be selected at run-time; 
specialized refinement criteria, which can be based on any variable or combination of variables 
that can be constructed from the fluid state or particle information, can also be used. The 
ta gged cells are grouped into rectangular grids at level £ using the clustering algorithm given 



m 



Berger Sz Rigoutsosl ( 119911 ). These rectangular patches are then refined to form the grids 
at level I + 1. Large patches are broken into smaller patches for distribution to multiple 
processors based on a user-specified max_gridsize parameter. 

A particle, i, is said to "live" at level £ if level £ is the finest level for which the union of 
grids at that level contains the particle's location, x^. The particle is said to live in the n th 
grid at level £ if the particle lives at level £ and the n th grid at level £ contains Xj. Each particle 
stores, along with its location, mass and velocity, the current information about where it 
lives, specifically the level, grid number, and cell, (i,j,k). Whenever particles move, this 
information is recomputed for each particle in a "redistribution" procedure. The size of a 
particle in the PM scheme is set to be the mesh spacing of the level at which the particle 
currently lives. 

At the beginning of every kt level £ time steps, where k^ > 1 is specified by the user 
at run-time, new grid patches are defined at all levels £ + 1 and higher if £ < £ max - In 
regions previously covered by fine grids the data is simply copied from old grids to new; 
in regions that are newly refined, data is interpolated from underlying coarser grids. The 
interpolation procedure constructs piecewise linear profiles within each coarse cell based 
on nearest neighbors; these profiles are limited so as to not introduce any new maxima or 
minima, then the fine grid values are defined to be the value of the trilinear profile at the 
center of the fine grid cell, thus ensuring conservation of all interpolated quantities. All 
particles at levels £ through £ max are redistributed whenever the grid hierarchy is changed 
at levels £ + 1 and higher. 
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4.2. Subcycling Options 

Nyx supports four different options for how to relate Ate, the time step at level £ > 0, 
to Ato, the time step at the coarsest level. The first is a non-sub cycling option; in this case 
all levels are advanced with the same At, i.e. Ati = At Q for all £. The second is a "standard" 
subcycling option, in which At/ Ax is constant across all levels, i.e. Atg = r e At for spatial 
refinement ratio, r, between levels. These are both standard options in multilevel algorithms. 
In the third case, the user specifies the subcycling pattern at run-time; for example, the 
user could specify subcycling between levels and 1, but no subcycling between levels 1, 
2 and 3. Then At 3 = At 2 = Ati — r AtQ. The final option is "optimal subcycling," in 
which the optimal subcycling algorithm decides at each coarse time step, or other specified 
interval, what the subcycling pattern should be. For example, if the time step at each level 
is constrained by At a , the limit that enforces the condition that a not change more than 
1% in a time step, then not subcycling is the most efficient way to advance the solution on 
the multilevel hierarchy. If dark matter particles at level 2 are moving a factor of two faster 
than particles at level 3 and Atd m is the limiting constraint in choosing the time step at 
those levels, it may be most efficient to advance levels 2 and 3 with the same At, but to 
subcycle level 1 relative to level and level 2 relative to level 1. These choices are made by 
the optimal subcycling algorithm during the run; th e algor ithm used to compute the optimal 



subcycling pattern is described in I Van Andel et all ( 120131 ). 



In addition to the above examples, additional considerations, such as the ability to 
parallelize over levels rather than just over grids at a particular level, may arise in large- 
scale parallel computations, which can make more complex subcycling patterns more efficient 
than the "standard" pattern. We defer further discussion of these cases to future work in 
which we demonstrate the trade-offs in computational efficiency for specific cosmological 
examples. 



4.3. Multilevel Algorithm 

We define a complete multilevel "advance" as the combination of operations required to 
advance the solution at level by one level time step and operations required to advance 
the solution at all finer levels, < £ < £ finest, to the same time. We define ri£ as the number 
of time steps taken at level £ for each time step taken at the coarsest level, i.e. At = n^Ati 
for all levels, < £ < £ finest- If the user specifies no subcycling, then ri£ = 1; if the user 
specifies standard subcycling then ri£ = r e . If the user defines a subcycling pattern, this is 
used to compute rig at the start of the computation; if optimal subcycling is used then ri£ is 
re-computed at specified intervals. 
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The multilevel advance begins by computing Ate for all £ > 0, which first requires 
the computation of the maximum possible time step At maXj ^ at each level independently 
(including only the particles at that level). Given a subcycling pattern and At max £ at each 
level, the time steps are computed as: 

At = min (n e ■ At max/ ) (30) 

0<£<£ finest 

At , \ 

At, = — S 31 

Once we have computed At, for all £ and advanced a from time t to t + Ato , a complete 
multilevel "advance" is defined by calling the recursive function, Advance(^), described 
below, with £ = 0. We note that in the case of no-sub cycling, the entire multilevel advance 
occurs with a single call to Advance(O); in the case of standard subcycling Advance(^) 
will be called recursively for each £, < £ < £ finest- Intermediate subcycling patterns require 
calls to Advance(^) for every level £ that subcycles relative to level £ — 1. In the notation 
below, the superscript n refers to the time, t n , at which each call to Advance begins at level 
£, and the superscript n + 1 refers to time t n + At,. 

Advance (£): 

Step 0: Define £ a as the finest level such that Ate a = Ate- 

If we subcycle between levels £ and £ + 1 then £ a = £ and this call to Advance only 
advances level £. If, however, we don't subcycle we can use a more efficient multi-level 
advance on levels £ through £ a . 

Step 1: Compute <p n and g n at levels £ — > £ finest using p% and p^ m , where p^ m is computed 
from the particles at x". 

This calculation involves solving the Poisson equation on the multilevel grid hierarchy 
using all finer levels; further detail about computing pd m and solving on the multilevel 
hierarchy is given in the next section. If has already been computed at this time 
during a multilevel solve in Step 1 at a coarser level, this step is skipped. 

Step 2: Advance U by Ate at levels £ — > £ a . 

Because we use an explicit method to advance the gas dynamics, each level is advanced 
independently by Ate, using Dirichlet boundary conditions from a coarser level as 
needed for boundary conditions. If £ a > £, then after levels £ — > £ a have been advanced. 



we pe rform an explicit reflux of the hydrodynamic quantities (see, e.g.. lAlmgren et al. 



( 120101 ) for more details on refluxing) from level £ a down to level £ to ensure conservation. 
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Step 3: Interpolate g n at levels £ — > £ a from the grid to the particle locations, then advance 
the particle velocities at levels I — >■ £ a by Ate/2 and particle positions by Ate- The 
operations in this step are identical to those in the single-level version, and require no 
communication between grids or between levels. Although a particle may move from 
a valid cell into a ghost cell, or from a ghost cell into another ghost cell, during this 
step, it remains a level £ particle until redistribution occurs after the completion of the 
level £ — 1 timestep. 

Step 4: Compute n+1 and g n+1 at levels £ — > £ a using p^ +1 '* and p^\, where p^ 1 is 
computed from the particles at x" +1 . 

Unlike the solve in Step 1, this solve over levels £ — > £ a is only a partial multilevel 
solve unless this is a no-subcycling algorithm; further detail about the multilevel solve 
is given in the next section. 

Step 5: Interpolate g n+1 at levels £ — > £ a from the grid to the particle locations, then update 
the particle velocities at levels £ — > £ a by an additional Ate/2. Again this step is 
identical to the single-level algorithm. 

Step 6: Correct U at levels £ — > £ a with time-centered source terms, and replace e by E—^U 2 
as appropriate. 

Step 7: If £ a < £f inest , Advance^ + 1) n ia+ i/n e times. 

Step 8: Perform any final synchronizations 

• If £ a < £ finest, reflux the hydrodynamic quantities from level £ a + 1 to level £ a . 

• If £ a < £ finest, average down the solution from level £ a + 1 to level £ a . 

• If £ < £ a , average down the solution from level £ a through level £. 

• If £ a < £ finest, perform an elliptic synchronization solve to compute 5(j) at levels 
£ through £fi nes t, where the change in (j) results from the change in p at level £ a 
due to refluxing from level £ a + 1, and from the mismatch in Neumann boundary 
conditions at the £ a / £ a + 1 interface (that results because the finest level in the 
solve from Step 4 was £ a , not £fi nes t)- Add 5<p to <f> a t levels £ through £ fi nest- 



This synchronization solve is described in more detail in lAlmgren et al.l (120101 ) for 
the case of standard subcycling. 

Redistribute particles at levels £ —> £fi nes t] Note that, if £ > 0, particles that 
started the time step at level £ but now would live at level £ — 1 are kept for now 
in the ghost cells at level l\ they are not redistributed to level £ — 1 until the level 
£ — 1 timestep is complete. 
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4.4. Gravity Solves 

In the algorithm as described above, there are times when the Poisson equation for the 
gravitational potential, 

V 2 0= (pb + pdm-po), (32) 

a 

is solved simultaneously on multiple levels, and times when the solution is only required on 
a single level. We define L e as the standard 7-point finite difference approxi mation to V 2 at 



l evel I , modified as appropriate at the £/(£ — 1) interface if £ > 0; see, e.g., lAlmgren et al. 



( 119981 ) for details of the multilevel cell-centered interface stencil. 
When £ = £ fi nes t in Step 1, or £ — £ a in Step 4, we solve 

, o , o Al'kG , o o \ 
-{Pb + Pdm-P0) 



a 

on the union of grids at level £ only; this is referred to as a level solve. 

If £ = then the grids at this level cover the entire domain and the only boundary 
conditions required are periodic boundary conditions at the domain boundaries. If £ > 
then Dirichlet boundary conditions for the level solve at level £ are supplied at the £/ (£ — 1) 
interface from data at level £ — 1. Values for <j) at level £ — 1 are assumed to live at the cell 
centers of level £—1 cells; these values are interpolated tangentially at the £/ (£—1) interface to 
supply boundary values with spacing Ax for the level £ grids. The modified stencil obviates 
the need for interpolation normal to the interface. The resulting linear system is solved using 
geometric multigrid techniques, specifically V-cycles with red-black Gauss-Seidel relaxation. 
We note that after a level £ solve with £ > 0, <fi at levels £ and £ — 1 satisfy Dirichlet but not 
Neumann matching conditions. This mismatch is corrected in the elliptic synchronization 
described in Step 8. 

When a multilevel solve is required, we define L c £ °™ p as the composite grid approximation 
to V 2 on levels £ through m, and define a composite solve as the process of solving 

i™ p r p = —(p b omp + PdT - po) 

(X 

on levels £ through m. The solution to the composite solve satisfies 

Lm0 co m p = 47TG 

a 



at level m, but satisfies 



- t' if 4:7tG , fl fl . 

' <f> = —(Pb +P dm -po) 
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for I < £' < m only on the regions of each level not covered by finer grids or adjacent to the 
boundary of the finer grid region. In regions of a level £' grid covered by level £' + 1 grids £ ' is 
defined as the volume average of 4> e>+1 ; in level £' cells immediately adjacent t o the boundary 



of the union of level £' + 1 grids, a modified interface operator as described in lAlmgren et al. 



( 119981 ) is used. This linear system is also solved using geometric multigrid with V-cycles; 
here levels £ through m serve as the finest levels in the V-cycle. After a composite solve 
on levels £ through m, satisfies both Dirichlet and Neumann matching conditions at the 
interfaces between grids at levels £ through m; however, there is still a mismatch in Neumann 
matching conditions between at levels £ and £ — 1 if £ > 0. This mismatch is corrected in 
the elliptic synchronization described in Step 8. 



4.5. Cloud-in-Cell Scheme 

Nyx uses the cloud-in-cell scheme to define the contribution of each dark matter parti- 
cle's mass to pd m ; a particle contributes to pd m only on the cells that intersect the cube of side 
Ax centered on the particle. In the single-level algorithm, all particles live on a single level 
and contribute to pdm at that level. If there are multiple grids at a level, and a particle lives 
in a cell adjacent to a boundary with another grid, some of the particle's contribution will 
be to the grid in which it lives, and the rest will be to the adjacent grid(s). The contribution 
of the particle to cells in adjacent grid(s) is computed in two stages: the contribution is first 
added to a ghost cell of the grid in which the particle lives, then the value in the ghost cell is 
added to the value in the valid cell of the adjacent grid. The right-hand-side of the Poisson 
solve sees only the values of pd m on the valid cells of each grid. 

In the multilevel algorithm, the cloud-in-cell procedure becomes more complicated. Dur- 
ing a level £ level solve or a multilevel solve at levels £ — > m, a particle at level £ that lives 
near the £/(£ — 1) interface may "lose" part of its mass across the interface. In addition, 
during a level solve at level £, or a multi-level solve at levels £ — > m, it is essential to include 
the contribution from dark matter particle at other levels even if these particles do not lie 
near a coarse-fine boundary. Both of these complications are addressed by the creation and 
use of ghost and virtual particles. 

The mass from most particles at levels £' < £ is accounted for in the Dirichlet boundary 
conditions for <p e that are supplied from <p e ~ l at the £/ (£ — 1) interface. However, level £ — 1 
particles that live near enough to the £/(£—!) interface to deposit part of their mass at level 
£, or have actually moved from level £ — 1 to level £ during a level £ — 1 timestep, are not 
correctly accounted for in the boundary conditions. The effect of these is included via ghost 
particles, which are copies at level £ of level £ — 1 particles. Although these ghost particles 
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live and move at level £, they retain the size, Ax £ 1 of the particles of which they are copies. 

The mass from particles living at levels £' > m in a level m level solve or a level £ — > m 
multilevel solve, is included at level m via the creation of virtual particles. These are level 
m representations of particles that live at levels £' > m. Within Nyx there is the option to 
have one-to-one correspondence of fine level particles to level m virtual particles; the option 
also exists to aggregate these fine level particles into fewer, more massive virtual particles 
at level m. The total mass of the finer level particles is conserved in either representation. 
These virtual particles are created at level m at the beginning of a level m timestep and 
moved along with the real level m particles; this is necessary because the level £' particles 
will not be moved until after the second Poisson solve with top level m has occurred. The 
mass of the virtual particles contributes to p^ m at level m following the same cloud-in-cell 
procedure as followed by the level m particles, except that the virtual particles have size 
Ax m+1 (even if they are copies of particles at level £' with £' > m + 1). 



5. Software Design and Parallel Performance 
5.1. Overview 



Nyx is implemented within the BoxLib (iBoxLibl |2011| ) framework, a hybrid C++ / 
Fortran90 software system that provides support for the development of parallel block- 
structured AMR applications. The basic parallelization strategy uses a hierarchical pro- 
gramming approach for multicore architectures based on both MPI and OpenMP. In the 
pure-MPI instantiation, at least one grid at each level is distributed to each core, and each 
core communicates with every other core using only MPI. In the hybrid approach, where on 
each socket there are n cores that all access the same memory, we can instead have fewer, 
larger grids per socket, with the work associated with those grids distributed among the n 
cores using OpenMP. 

In BoxLib, memory management, flow control, parallel communications and I/O are 
expressed in the C++ portions of the program. The numerically intensive portions of the 
computation, including the multigrid solvers, are handled in Fortran90. The software sup- 
ports two data distribution schemes for data at a level, as well as a dynamic switching scheme 
that decides which approach to use based on the number of grids at a level and the number 
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5.2. Particle and Particle-Mesh Operations 

BoxLib also contains support for parallel particle and particle-mesh operations. Particles 
are distributed to processors or nodes according to their location; the information associated 
with particle % at location Xj exists only on the node that owns the finest-level grid that 
contains Xj. Particle operations are performed with the help of particle iterators that loop 
over particles; there is one iterator associated with each MPI process, and each iterator loops 
over all particles that are associated with that process. 

Thus interactions between particles and grid data, such as the calculation of pd m , or 
the interpolation of g from the grid to Xj, require no communication between processors 
or nodes; they scale linearly with the number of particles, and exhibit almost perfect weak 
scaling. Similarly, the update of the particle velocities using the interpolated gravitational 
acceleration, and the update of the particle locations using the particle velocities, require 
no communication. The only inter-node communication of particle data (location, mass and 
velocity) occurs when a particle moves across coarse-fine or fine-fine grid boundaries; in this 
case the redistribution process re-computes the particle level, grid number, and cell, and 
the particle information is sent to the node or core where the particle now lives. We note, 
though, that when information from the particle is represented on the grid hierarchy, such 
as when the particle "deposits" its mass on the grid structure in the form of a density field 
used in the Poisson solve, that grid-based data may need to be transferred between nodes if 
the particle is near the boundary of the grid it resides in, and deposits part of its mass into 
a different grid at the same or a different level. 

Looking ahead, we also note here that the particle data structures in BoxLib are written 
in a sufficiently general form with C++ templates that they can easily be used to represent 
other types of particles (such as those associated star formation mechanisms) which carry 
different numbers of attributes beyond location, mass and velocity. 



5.3. Parallel I/O and Visualization 



As simulations grow increasingly large, the need to read and write large data-sets effi- 
ciently becomes increasingly critical. Data for checkpoints and analysis are written by Nyx 
in a self-describing format that consists of a directory for each time step written. Checkpoint 
directories contain all necessary data to restart the calculation from that time step. Plotfile 
directories contain data for postprocessing, visualization, and analytics, which can be read 
using amrvis, a cu stomized visualization pack age dev eloped at LBNL f or visualizing data on 
AMR grids, Visit (jvislt User's Manual! 120051 ) . or yt hurk et allboilh . Within each check- 
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point or plotfile directory is an ASCII header file and subdirectories for each AMR level. 
The header describes the AMR hierarchy, including the number of levels, the grid boxes at 
each level, the problem size, refinement ratio between levels, step time, etc. Within each 
level directory are multiple files for the data at that level. Checkpoint and plotfile directories 
are written at user-specified intervals. 

For output, a specific number of data files per level is specified at run time. Data 
files typically contain data from multiple processors, so each processor writes data from 
its associated grid(s) to one file, then another processor can write data from its associated 
grid(s) to that file. A designated I/O processor writes the header files and coordinates which 
processors are allowed to write to which files and when. The only communication between 
processors is for signaling when processors can start writing and for the exchange of header 
information. While I/O performance even during a single run can be erratic, recent timings 
of parallel I/O on the Hopper machine at NERSC, which has a theoretical peak of 35GB/s, 
indicate that Nyx's I/O performance, when run with a single level composed of multiple 
uniformly-sized grids, can reach over 34GB/s. 

Restarting a calculation can present some difficult issues for reading data efficiently. 
Since the number of files is generally not equal to the number of processors, input during 
restart is coordinated to efficiently read the data. Each data file is only opened by one 
processor at a time. The designated I/O processor creates a database for mapping files to 
processors, coordinates the read queues, and interleaves reading its own data. Each processor 
reads all data it needs from the file it currently has open. The code tries to maintain the 
number of input streams to be equal to the number of files at all times. 

Checkpoint and plotfiles are portable to machines with a different byte ordering and 
precision from the machine that wrote the files. Byte order and precision translations are 
done automatically, if required, when the data is read. 



5.4. Parallel Performance 

In Figure [1] we show the scaling behavior of the Nyx code using MPI and OpenMP on 
the Hopper machine at NERSC. A weak scaling study was performed, so that for each run 
there was exactly one 128 3 grid per NUMA node, where each NUMA node has 6 coresJl] 



1 We note that for small to moderate numbers of cores, pure-MPI runs can be significantly faster than 
runs using hybrid MPI and OpenMP. However, for the purposes of this scaling study we report only hybrid 
results. 
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Fig. 1. — Weak scaling behavior of Nyx on the NERSC Hopper (Cray XE 6) machine for 
simulations of the replicated Santa Barbara problem. 



The problem was initialized using replication of the Santa Barbara problem, described 
in the next section. First, the original 256 3 particles were read in, then these particles were 
replicated in each coordinate direction according to the size of the domain. For example, 
on a run with 1024 x 1024 x 2048 cells, the domain was replicated 4 times in the x- and y- 
directions and 8 times in the z-direction. With one 128 3 grid per NUMA node, this run would 
have used 1024 processors. The physical domain was scaled with the index space, so that 
the resolution (Ax) of the problem didn't change with weak scaling, thus the characteristics 
of the problem which might impact the number of multigrid V-cycles, for example, were 
unchanged as the problem got larger. 

The timings in Figure [1] are the time per time step spent in different parts of the 
algorithm for a uniform grid calculation. The "hydro" timing represented all time spent to 
advance the hydrodynamic state, excluding the calculation of the gravity. The computation 
of gravity is broken into two parts; the set-up and initialization of the multigrid solvers, and 
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the actual multigrid V-cycles themselves. The set-up phase includes the assignment of mass 
from the particles to the mesh using the cloud-in-cell scheme. In this run, each multigrid 
solve took 7 V-cycles to reach a tolerance of 10~ 12 . We note that, for this problem which 
has an average of one particle per cell, the contribution to the total time from moving and 
redistributing the particles is negligible. 

We see here that the Nyx code scales well from 48 to 49,152 processors, with a less than 
50% increase in total time with the 1000-fold increase in processors. The hydrodynamic core 
of CASTRO, which is essentially the same as that of Nyx, has show n almost fiat scaling 
to 21 IK processors on the Jaguarpf machine at OLCF (IBoxLibl 120111 ). and only a modest 
overhead from using AM R with subcycl i ng, r anging from roughly 5% for 8 processors to 
19% for 64K processors (lAlmgren et al.l |2010| ). MAESTRO, another BoxLib-based code 
which uses the same lin ear solyer fra mework as Nyx, has demonstrated excellent scaling to 
96K cores on Jaguarpf (IBoxLibl 120111 ). 



6. Validation 

The hydrodynamic integrator in Nyx was built by extending the integrator in CASTRO 
to the equations for an expanding universe; the Nyx hydrodyn amics have been valid ated 



using the same tests, and with the same results, as described in lAlmgren et al.l ( 120101 ). To 
isolate and study the behavior of the component of the algorithm responsible for updating 
the particle locations and velocities, we have performed a resolution study of a simple two- 
particle orbit. In a nondimensionalized cubic domain 16 units on a side, we initialized two 
particles of equal mass each at a distance of 1 unit in the x-direction from the center of 
the domain. In order to compare the numerical results to the exact circular orbit solution, 
the initial particle velocities were specified to be those corresponding to a perfectly circular 
orbit, and Dirichlet boundary conditions, constructed as the sum of two monopole expansions 
(one from each particle), were imposed at the domain boundaries. Six numerical tests were 
performed: three with a base grid of 64 3 , and either 0, 1 or 2 levels of refinement; one with 
a base grid of 128 3 and or 1 level of refinement, and finally a uniform 256 3 case. Three 
key observations followed. First, the orbits of the particles were stable over the ten orbits 
observed, with little variation in the measured properties over time. Second, the orbit radius 
and kinetic energy of each particle converged with mesh spacing; the maximum deviations 
of the orbit radius and the particle's kinetic energy from their initial values were no more 
than approximately 1.1% and 2.4%, respectively, for the coarsest case and 0.17% and 0.36%, 
respectively, for the finest cases. Third, the difference in the deviation of the orbit radius 
between the single-level and multilevel simulations with the same finest resolution was less 
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than .0002% of the radius; the difference in deviation of the kinetic energy was less than 
.0004% of the initial kinetic energy. 

We have also conducted several simple tests with known analytical solutions, e.g. the 
MacLaurin spheroid and Zel'dovich pancakes, to additionally validate the particle dynamics 
and the expected second-order accuracy and convergence properties of the Nyx gravity solver. 
Finally in this section, we present two validation stu dies in a cosmological context, using 
well-established data sets from the Cos mic Data ArXiv ([Heitmann et al.ll2005l ) and the Santa 
Barbara Cluster Comparison Project (IFrenk et al.lll999l ). 



6.1. Dark Matter Only Simulations 



To validate Nyx in a realistic cosmological setting, we first co mpare Nyx Kray i ty-onl y 
simulation results to one of the comparison data sets described in iHeitmann et al.l (120051 ). 
The domain is 256 Mpc/h on a side, and the cosmology considered is Q m = 0.314, = 
0.686, h = 0.71, as = 0. 84. Initial conditions are prov ided at z = 50 by applying the 



Klypin fc Holtzmanl (119971 ) transfer function and using the lZerdovichl (Il970l ) approximation. 
The simulation evolves 2 56 3 dark matter partic le s, with a mass resolutio n of 1.227 x lO n M . 
Two comparison papers (IHeitmann et al.l (120051 ) . IHeitmann et al.l (120081 )) demonstrated good 
agreement on several derived quantities among ten different, widely used cosmology codes. 

Here we compare our results to the Gadget-2 simulations as presented in the Heitmann 
et al. papers; we refer interested readers to those papers to see a comparison of the Gadget- 
2 simulations with other codes. We present results from three different Nyx simulations: 
uniform grid simulations at 256 3 and 1024 3 , and a simulation with a 256 3 base grid and two 
levels of refinement, each by a factor of two. In Fig. [2J we show correlation functions (upper 
panel), and ratio to the Gadget-2 results (lower panel), to allow for a closer inspection. We 
see a strong match between the Gadget-2 and Nyx results, and observe convergence of Nyx 
code towards the Gadget-2 results as the resolution increases. In particular, we note that 
the effective 1024 3 results achieved with AMR very closely match the results achieved with 
the uniform 1024 3 grid. 

Next, we examine the mass and spatial distribution of halos, two important statistical 
measures used in different ways throughout cosmology. To generate halo catalogs we use th e 
same halo finder for all runs, which finds friends-of-friends halos (FOF; iDavis et al.l (119851 )) 
using a linking length of b = 0.2. To focus on the differences between codes we consider 
halos with as few as ten particles, although in a real application we would consider only 
halos with at least hundreds of particles in order to avoid large inaccuracies due to the 
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finite sampling of FOF halos. In Figure [31 we show results for the mass function, and we 
confirm good agreement between the high r esoluti on Nyx run and G adget-2. As shown in 
O'Shea et al.l ( 120051 ) , iHeitmann et al.l ( 120051 ) , and iLukic et al.l (120071 ) , common refinement 
strategies suppress the halo mass function at the low-mass end. This is because small 
halos form very early and throughout the whole simulation domain; capturing them requires 
refinement so early and so wide, that block-based refinement (if not AMR in general) gives 
hardly any advantage over a fixed- grid simulation. N yx AMR results shown here are fully 
consistent with the cri teria given in lLukic et al.l ( 120071 ). and are similar to other AMR codes 
dHeitmann et al.ll2008l ). 

In Figure H] we present the correlation function for halos. Rather than looking at all 
halos together, we separate them into three mass bins. For the smallest halos we see the offset 
in the low resolution run, but even that converges quickly, in spite of the fact that the 1024 3 
run still has fewer small-mass halos than the Gadget run done with force resolution several 
times smaller (Fig. [3]). Finally, in Figure [5] we examine the inner structure of halos. We 
observe excellent agreement between the AMR run and the uniform 1024 3 run, confirming 
the expectation that the structure of resolved halos in the AMR simulation match those 
in the fixed-grid runs. Overall, the agreement between the Nyx and Gadget-2 results is as 
expected, and is on a par with other AMR cosmology codes at this resolution as demonstrated 
in the Heitmann et al. papers. 



6.2. Santa Barbara Cluster Simulations 



The Santa Barbara cluster has become a standard code comparison test problem for 
cluster formation in a realistic cosmological setting. To make meaningful comparisons, all 
codes start from the same set of initial conditions and evolve both the gas and dark matter to 
redshift z = assuming an ideal gas equation of state with no heating or cooling ( "adiabatic 
hydro"). Initial conditions are constrained such that a cluster (3 a rare) forms in the central 
part of the box. The simulation domain is 64 Mpc on a side, with the SCDM cosmology: 
Q = Q m = 1, Qb = 0.1, h = 0.5, = 0.9, and n s = 1. A 3D snapshot of the dark matter 
density and gas temperature at z = is shown i n Figure [6j For the Nyx simulation we use 
the exact initial conditions from IHeitmann et al.l (120051 ). and begin the simulation at z = 63 
with 256 3 particles. Detailed r esults from a code comparison study of the Santa Barbara 
test problem were published in iFrenk et al.l (119991 ); we present here not all, but the most 
significant diagnostics from that paper for comparison of Nyx with other available codes. 
We refer the interested reader to that paper for full details of the simulation results for the 
different codes. 
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We first examine the global cluster properties, calculated inside r2oo with respect to the 
critical dens i ty at redshift z = 0. We find the total mass of the cluster to be 1.15 x 1O 15 M ; 
Frenk et al.l (119991 ) report the mean and 1 a deviation as (1.1 ± 0.05 ) x 10 15 M ff for all codes. 
The radius of the cluster from the Nyx run is 2.7 Mpc; iFrenk et al.l (119991 ) report 2.7 ± 0.04. 
The NFW concentration is 7.1; the original paper gives 7.5 as a rough guideline to what the 
concentration should be. 

In Figure [7] we show comparisons of the radial profiles of several quantities: dark matter 
density, gas density, pressure, and entropy. Since the original data from the comparison 
project is not publicly available, we compare on ly with the average val ues, as well as the Enzo 
and ART data as estimated from the figures in Kravtsov et al. (l2002h. We also note that we 



did no t have exactly the same initial conditions as in 
(120021 ) and our starting redshift differs from that of Enzo 
exact agreement. 



Frenk et al 



Jl999h 



or 



Kravtsov et al. 



30), so we do not expect 



In the original code comparison, the definition of the cluster center was left to the 
discretion of each simulator, and here we adopt the gravitational potential minimum to 
define the cluster center. As in lFrenk et al.l (Il999l ). we radially bin quantities in 15 spherical 
shells, evenly spaced in log radius, and covering 3 orders of magnitude - from 10 kpc to 10 
Mpc. The pressure reported is p = p gas T, and entropy is defined as s = ln(T / (?Jas) ■ We find 
good agreement between Nyx and the other AMR codes, including the well-known entropy 
flattening in the central part of the cluster. In sharp contras t to grid methods, SP H codes 
find central entropy to continue rising towards smaller radii. iMitchell et al.l (120091 ) made a 
detailed investigation of this issue, and have found that the difference between the two is 
independent of mesh sizes in grid codes, or particle number in SPH. Instead, the difference 
is fundamental in nature, and is shown to origi nate a s a con sequence of the suppression of 
eddies and fluid instabilities in SPH. (See also lAbell (120111 ) for an SPH formulation that 
improves mixing at a cost of violating exact momentum conservation.) Overall, we find 
excellent agreement of Nyx with existing AMR codes. 



7. Conclusions and Future Work 

We have presented a new N-body and gas dynamics code, Nyx, for large-scale cosmolog- 
ical simulations. Nyx is designed to efficiently utilize tens of thousands of processors; timings 
of the code up to almost 50,000 processors show excellent weak scaling behavior. Validation 
of Nyx in pure dark matter runs and dark matter with adiabatic hydrodynamics has been 
presented. Future papers will give greater detail on the implementation in Nyx of source 
terms, and will present results from simulations incorporating different heating and cooling 



-27- 



mechanisms of the gas, as needed for increased fidelity in different applications. Scientific 
studies already underway with Nyx include studies of the Lyman-a forest and galaxy clus- 
ters. In addition, we plan to extend Nyx to allow for simulation of alternative cosmological 
models to ACDM, most interestingly dynamical dark energy and modifications of Einstein's 
gravity. The existing grid structure and multigrid Poisson solver should make it straightfor- 
ward to extend the current capability to iterative methods for solving the non-linear elliptic 
equations arising in those models. 
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Fig. 2. — Two-point correlation function of dark matter particles in 256 Mpc/h box (upper 
panel), and the ratio with respect to the Gadget-2 results (lower panel). We show Nyx 
results from two uniform-grid simulations and one AMR simulation with a 256 3 base grid 
and two levels of refinement, each by a factor of two. 
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Fig. 3. — Halo mass function also shows good convergence with th e increasing force resolu tion 
towards Gadget-2 results, run at higher resolution. Solid line is fjSheth fc Tormen!ll999l ) fit, 
shown here purely to guide the eye; ratios are taken with respect to Gadget as in the other 
plots. 
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Fig. 4. — Two-point correlation function of dark matter halos, separated in three mass 
groups. We see good agreement between the two codes, as well as rapid convergence even for 
halos sampled with a few tens of particles. We emphasize that halos in the range of 10-100 
particles would not have been given serious consideration in a real application, but we show 
them here solely to examine differences between codes. Note that x-axis is different in all 3 
panels. 
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Fig. 5. — Density profiles for the three most massive objects in the Gadget-2 run. Lines are 
the best fit NFW profile to the Gadget-2 run. We can see that the profiles from the AMR 
simulation closely match those of the uniform 1024 3 simulation. 



Fig. 6. — Santa Barbara cluster simulation. We show 3 orthogonal cuts through the box cen- 
ter showing dark matter density in blue-white colors. Superposed in red are iso-temperature 
surfaces of 10 7 K gas showing rich structure of intergalactic gas throughout the simulations 
domain. 
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Fig. 7. — Left: density of dark matter (upper panel) and gas (lower panel); right: pressure 
(upper) and entropy (lower) profiles for the Santa Barbara cluster. Besides Nyx, we show 
the average from the original Frenk et al. 1999 paper, and results from two AMR codes, 
Enzo and ART. 



